Background and Methodology

For this LBB, I use Grand Slam men’s single matches data from 2000 to 2017 which I get from Kaggle. The data is structured by each row containing one match, including match info, along with winning and losing player’s statistics. The goal of this LBB is to explore how top tennis players group together based on their statistics, partiuclarly in the matches that have best of 5 format, using Principle Component Analysis (PCA) and k-means clustering analysis. In addition, I also explore how clusters of players differ by surface (hard, clay, and grass).

Load Library and Read Data

## [1] 3364   49
## [1] 3307   49
## [1] 3236   49
## [1] 3214   49
## [1] 3277   49
## [1] 3257   49
## [1] 3257   49
## [1] 3152   49
## [1] 3110   49
## [1] 3074   49
## [1] 3058   49
## [1] 3030   49
## [1] 3025   49
## [1] 2959   49
## [1] 2901   49
## [1] 2958   49
## Warning: 2926 parsing failures.
## row col   expected     actual                                        file
##   1  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2016.csv'
##   2  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2016.csv'
##   3  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2016.csv'
##   4  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2016.csv'
##   5  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2016.csv'
## ... ... .......... .......... ...........................................
## See problems(...) for more details.
## [1] 3004   49
## Warning: 386 parsing failures.
## row col   expected     actual                                        file
##   1  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2017.csv'
##   2  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2017.csv'
##   3  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2017.csv'
##   4  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2017.csv'
##   5  -- 49 columns 69 columns 'atp-matches-dataset//atp_matches_2017.csv'
## ... ... .......... .......... ...........................................
## See problems(...) for more details.
## [1] 388  49

Exploratory Data Analysis

Since we are going to use all data that is available to us, we combine this data into one data frame.

## [1] 53571    49
## Observations: 53,571
## Variables: 49
## $ tourney_id         <chr> "2000-717", "2000-717", "2000-717", "2000-717…
## $ tourney_name       <chr> "Orlando", "Orlando", "Orlando", "Orlando", "…
## $ surface            <chr> "Clay", "Clay", "Clay", "Clay", "Clay", "Clay…
## $ draw_size          <dbl> 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 3…
## $ tourney_level      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", …
## $ tourney_date       <dbl> 20000501, 20000501, 20000501, 20000501, 20000…
## $ match_num          <chr> "001", "002", "003", "004", "005", "006", "00…
## $ winner_id          <dbl> 102179, 103602, 103387, 101733, 101727, 10318…
## $ winner_seed        <dbl> NA, NA, NA, NA, 4, NA, NA, 5, 6, NA, NA, 3, N…
## $ winner_entry       <chr> NA, "Q", NA, NA, NA, NA, NA, NA, NA, NA, "WC"…
## $ winner_name        <chr> "Antony Dupuis", "Fernando Gonzalez", "Parado…
## $ winner_hand        <chr> "R", "R", "R", "L", "R", "R", "R", "R", "R", …
## $ winner_ht          <dbl> 185, 183, 185, 183, 185, 185, 178, 178, 183, …
## $ winner_ioc         <chr> "FRA", "CHI", "THA", "NED", "AUS", "CZE", "AR…
## $ winner_age         <dbl> 27.18138, 19.75633, 20.88159, 30.04791, 30.07…
## $ winner_rank        <dbl> 113, 352, 103, 107, 74, 92, 120, 79, 89, 125,…
## $ winner_rank_points <dbl> 351, 76, 380, 371, 543, 429, 322, 516, 464, 3…
## $ loser_id           <dbl> 102776, 102821, 102205, 102925, 101826, 10188…
## $ loser_seed         <dbl> 1, NA, NA, 8, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ loser_entry        <chr> NA, "WC", NA, NA, NA, NA, NA, NA, NA, NA, "WC…
## $ loser_name         <chr> "Andrew Ilie", "Cecil Mamiit", "Sebastien Lar…
## $ loser_hand         <chr> "R", "R", "R", "R", "R", "L", "R", "R", "R", …
## $ loser_ht           <dbl> 180, 173, 183, 196, 175, 190, 190, 180, 180, …
## $ loser_ioc          <chr> "AUS", "PHI", "CAN", "USA", "ESP", "AUS", "SU…
## $ loser_age          <dbl> 24.03559, 23.84394, 27.01164, 23.26078, 29.42…
## $ loser_rank         <dbl> 50, 139, 133, 95, 111, 102, 112, 91, 97, 117,…
## $ loser_rank_points  <dbl> 762, 280, 293, 408, 357, 381, 356, 430, 404, …
## $ score              <chr> "3-6 7-6(6) 7-6(4)", "6-2 7-5", "6-1 6-3", "4…
## $ best_of            <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ round              <chr> "R32", "R32", "R32", "R32", "R32", "R32", "R3…
## $ minutes            <dbl> 162, 86, 64, 150, 60, 115, 171, 66, 63, 123, …
## $ w_ace              <dbl> 8, 4, 4, 8, 3, 9, 4, 1, 2, 7, 1, 1, 4, 7, 0, …
## $ w_df               <dbl> 1, 2, 1, 6, 0, 4, 11, 1, 2, 2, 4, 3, 1, 0, 0,…
## $ w_svpt             <dbl> 126, 67, 46, 109, 50, 95, 148, 42, 49, 83, 56…
## $ w_1stIn            <dbl> 76, 35, 29, 56, 27, 60, 65, 23, 30, 55, 36, 2…
## $ w_1stWon           <dbl> 56, 25, 23, 43, 22, 43, 54, 19, 26, 42, 26, 2…
## $ w_2ndWon           <dbl> 29, 16, 11, 21, 16, 26, 43, 13, 10, 14, 11, 1…
## $ w_SvGms            <dbl> 16, 10, 8, 15, 9, 16, 18, 8, 9, 14, 9, 9, 11,…
## $ w_bpSaved          <dbl> 14, 4, 0, 9, 1, 5, 9, 0, 3, 3, 2, 3, 4, 7, 3,…
## $ w_bpFaced          <dbl> 15, 6, 0, 12, 1, 6, 12, 0, 4, 6, 3, 4, 5, 10,…
## $ l_ace              <dbl> 13, 0, 2, 4, 0, 11, 5, 0, 2, 3, 2, 2, 3, 4, 3…
## $ l_df               <dbl> 4, 0, 2, 6, 3, 8, 5, 0, 3, 2, 1, 2, 5, 0, 5, …
## $ l_svpt             <dbl> 110, 57, 65, 104, 47, 94, 126, 42, 47, 102, 5…
## $ l_1stIn            <dbl> 59, 24, 39, 57, 28, 48, 70, 25, 25, 62, 27, 3…
## $ l_1stWon           <dbl> 49, 13, 22, 35, 17, 31, 45, 9, 13, 43, 16, 18…
## $ l_2ndWon           <dbl> 31, 17, 10, 24, 10, 29, 36, 10, 10, 14, 12, 8…
## $ l_SvGms            <dbl> 17, 10, 8, 15, 8, 15, 18, 7, 8, 14, 9, 8, 10,…
## $ l_bpSaved          <dbl> 4, 4, 6, 6, 3, 6, 3, 3, 3, 9, 0, 3, 7, 7, 1, …
## $ l_bpFaced          <dbl> 4, 9, 10, 11, 6, 9, 6, 7, 7, 13, 4, 7, 10, 11…

Here, what I do is choosing the matches that have best of 5 format, since we only want to know the statistics of each player when they play grand slam tournament and oher tournament that have same format, which basically a best of 5 tournament. With that in mind, we don’t consider any matches beside grand slam tournament and other tournament.

## [1] 8739   45

Here are all of the tournament that have best of 5 format. Apparently, not only Grand Slam that have best of 5 format, but there are several masters tournament that also have same format. Davis Cup, the ‘World Cup’ of tennis, also have same format.

##   [1] "US Open"                     "Indian Wells Masters"       
##   [3] "Australian Open"             "Vienna"                     
##   [5] "Miami Masters"               "Wimbledon"                  
##   [7] "Barcelona"                   "Stockholm"                  
##   [9] "Rome Masters"                "Hamburg Masters"            
##  [11] "Monte Carlo Masters"         "Amsterdam"                  
##  [13] "Stuttgart Outdoor"           "Basel"                      
##  [15] "Stuttgart Masters"           "Paris Masters"              
##  [17] "Roland Garros"               "Masters Cup"                
##  [19] "Kitzbuhel"                   "Hong Kong"                  
##  [21] "Stuttgart"                   "Gstaad"                     
##  [23] "Amersfoort"                  "Madrid Masters"             
##  [25] "Beijing Olympics"            "London Olympics"            
##  [27] "Olympics"                    "Us Open"                    
##  [29] "Davis Cup G1 R1: BAR vs ECU" "Davis Cup G1 R2: BRA vs ECU"
##  [31] "Davis Cup G1 R2: CHI vs COL" "Davis Cup G1 R1: CHI vs DOM"
##  [33] "Davis Cup G1 R2: CHN vs UZB" "Davis Cup G1 R1: NZL vs KOR"
##  [35] "Davis Cup G1 R2: ESP vs ROU" "Davis Cup G1 R1: HUN vs ISR"
##  [37] "Davis Cup G1 R2: HUN vs SVK" "Davis Cup G1 R1: POR vs AUT"
##  [39] "Davis Cup G1 R1: ROU vs SLO" "Davis Cup G1 R2: RUS vs NED"
##  [41] "Davis Cup G1 R1: RUS vs SWE" "Davis Cup G1 R2: UKR vs AUT"
##  [43] "Davis Cup G2 R2: ESA vs VEN" "Davis Cup G2 R1: MEX vs GUA"
##  [45] "Davis Cup G2 R2: PER vs MEX" "Davis Cup G2 R3: PER vs VEN"
##  [47] "Davis Cup G2 R1: SRI vs THA" "Davis Cup G2 R2: TPE vs PHI"
##  [49] "Davis Cup G2 R3: TPE vs THA" "Davis Cup G2 R1: BIH vs TUN"
##  [51] "Davis Cup G2 R3: BLR vs DEN" "Davis Cup G2 R1: BUL vs TUR"
##  [53] "Davis Cup G2 R1: EGY vs BLR" "Davis Cup G2 R2: FIN vs DEN"
##  [55] "Davis Cup G2 R2: LAT vs BLR" "Davis Cup G2 R3: LTU vs BIH"
##  [57] "Davis Cup G2 R1: LTU vs NOR" "Davis Cup G2 R2: LTU vs RSA"
##  [59] "Davis Cup G2 R1: MON vs LAT" "Davis Cup G2 R1: RSA vs LUX"
##  [61] "Davis Cup G2 R2: TUR vs BIH" "Davis Cup G2 R1: ZIM vs FIN"
##  [63] "Davis Cup WG F: ARG vs CRO"  "Davis Cup WG R1: ARG vs POL"
##  [65] "Davis Cup WG R1: CAN vs FRA" "Davis Cup WG R1: CRO vs BEL"
##  [67] "Davis Cup WG SF: FRA vs CRO" "Davis Cup WG QF: FRA vs CZE"
##  [69] "Davis Cup WG SF: GBR vs ARG" "Davis Cup WG R1: GBR vs JPN"
##  [71] "Davis Cup WG QF: GBR vs SRB" "Davis Cup WG R1: GER vs CZE"
##  [73] "Davis Cup WG QF: ITA vs ARG" "Davis Cup WG R1: SRB vs KAZ"
##  [75] "Davis Cup WG R1: SUI vs ITA" "Davis Cup WG R1: USA vs AUS"
##  [77] "Davis Cup WG QF: USA vs CRO" "Davis Cup WG PO: AUS vs SVK"
##  [79] "Davis Cup WG PO: BEL vs BRA" "Davis Cup WG PO: CAN vs CHI"
##  [81] "Davis Cup WG PO: ESP vs IND" "Davis Cup WG PO: GER vs POL"
##  [83] "Davis Cup WG PO: JPN vs UKR" "Davis Cup WG PO: KAZ vs RUS"
##  [85] "Davis Cup WG PO: SUI vs UZB" "Davis Cup G1 R1: DOM vs CHI"
##  [87] "Davis Cup G1 R1: PER vs ECU" "Davis Cup G1 R1: NZL vs IND"
##  [89] "Davis Cup G1 R1: UZB vs KOR" "Davis Cup G1 R1: ISR vs POR"
##  [91] "Davis Cup G1 R1: POL vs BIH" "Davis Cup G1 R1: ROU vs BLR"
##  [93] "Davis Cup G1 R2: SVK vs HUN" "Davis Cup G2 R1: BOL vs ESA"
##  [95] "Davis Cup G2 R1: INA vs PHI" "Davis Cup G2 R1: KUW vs THA"
##  [97] "Davis Cup G2 R1: VIE vs HKG" "Davis Cup G2 R1: EST vs RSA"
##  [99] "Davis Cup G2 R1: FIN vs GEO" "Davis Cup G2 R1: MAR vs DEN"
## [101] "Davis Cup G2 R1: MON vs SLO" "Davis Cup G2 R1: NOR vs LAT"
## [103] "Davis Cup G2 R1: SWE vs TUN" "Davis Cup G2 R1: TUR vs CYP"
## [105] "Davis Cup WG R1: ARG vs ITA" "Davis Cup WG R1: BEL vs GER"
## [107] "Davis Cup WG R1: CAN vs GBR" "Davis Cup WG R1: CZE vs AUS"
## [109] "Davis Cup WG R1: ESP vs CRO" "Davis Cup WG R1: RUS vs SRB"
## [111] "Davis Cup WG R1: SUI vs USA"

Next, we want to know how many sets did each player win in each of those matches.

Out of curiousity and to strengthen the clustering, we also want to know how many time did the players win from tie break in each of the match.

Lastly, we add match statistics, that can give us a clue on the characteristics of each player in each matches.

## [1] 8739   53

Data Analysis

We will use PCA to know how does the player cluster based on their statistics throughout 2000 - 2017. But, before doing that, first we should create another data frame that gather all best player based on the statistics above. By creating this data frame, the PCA analysis and clustering become possible. Each row in best player dataframe is a player who has been ever in top 30 from 2000 to 2017 and played more than 50 matches in best of 5 singles. The variables in player data frame, include:

-player_1stServeWon_p: first serve win percentage -player_2ndServeWon_p: second serve win percentage -player_ace_p: ace percentage -player_df_p: double fault percentage -games_per_match: average number of games per match -player_tbWon_p: tiebreak won percentage -player_bpSaved_p: breakpoint saved percentage -opponent_bpBreak_p: breaking opponent’s serves percentage

##Best Player

Principle Component Analysis (PCA)

Now, let’s create PCA out of best player data frame and print the result.

## Standard deviations (1, .., p=8):
## [1] 1.9102289 1.2302225 1.0611106 0.7386093 0.7067697 0.6180102 0.4708836
## [8] 0.2507734
## 
## Rotation (n x k) = (8 x 8):
##                              PC1        PC2         PC3         PC4
## player_1stServeWon_p -0.47089898  0.0602451 -0.06746803  0.19733797
## player_2ndServeWon_p -0.29728112  0.3986285 -0.40460827  0.25413178
## player_ace_p         -0.48455081 -0.1357800  0.03785360  0.14224914
## player_df_p          -0.07840918 -0.2370925 -0.85531772 -0.21417214
## games_per_match      -0.30279238 -0.4811566  0.07126199 -0.39489208
## player_tbWon_p       -0.23930376  0.5325628  0.09954829 -0.78254249
## player_bpSaved_p     -0.42618417  0.2455993  0.18478353  0.24376630
## opponent_bpBreak_p    0.34403218  0.4328544 -0.22272367  0.05133928
##                              PC5         PC6          PC7          PC8
## player_1stServeWon_p  0.42368638 -0.24633822  0.310070903  0.628326221
## player_2ndServeWon_p -0.53577598  0.36754705  0.317812005 -0.035738034
## player_ace_p          0.33097350 -0.07777663  0.215020006 -0.750520037
## player_df_p           0.16932865 -0.04317701 -0.359811435 -0.014153121
## games_per_match      -0.56135086 -0.37986407  0.227414084  0.068252756
## player_tbWon_p        0.15444784  0.10927714 -0.008833189 -0.030891465
## player_bpSaved_p     -0.23147323 -0.28328655 -0.728441241  0.004827063
## opponent_bpBreak_p   -0.05166739 -0.74821368  0.211499081 -0.186580530

Let’s plot the PCA data using biplot() function.

To make the plot more clearer, we could exclude the player name from the plot by using fviz_pca_byplot(). We could also know from this plot that PCA has worked well in our data since PCA1 and PCA2 already capture 64.5% of the variation.

To know the correlation, relationship, and outliers between each variables, let’s bring back the player’s name.

Conclusion from the PCA Plot: From the vectors, we could know that: - The more tiebreak one can break from his opponent (opponent_bp_Break_p), the fewer games he has to play per match (games_per_match). - If one is good at serving (1stServeWon_p and 2ndServeWon_p), he will also be good at winning tiebreaks (player_tbWon_p) and saving break points (player_bpSaved_p) . However, he isn’t necessary good at breaking opponent’s serves. -The ones who are good at serving doesn’t necessarily have the best aces.

From the points, we could know that: - 2 outlier points on the lower left corner are I. Karlovic (most carrer aces) and J. Isner (most aces in a tournament). Isner also plays most games per match in this dataset. - 2 outlier points on the upper left corner are M. Raonic (one of the best servers, all-court style) and P. Sampras (precise and powerful serve, all-court player) - 4 points on the top (around the vertical axis) are Nadal, Djokovic, Agassi and Murray. Federer is a little bit to the left. It seems they are both good at winning service games and breaking their opponent’s serves.

K-Means Clustering

Before we create cluster for each player, we should the optimal number of cluster using this elbow method below:

Apparently, 5 cluster is the most optimum cluster we can get. Now, let’s create the plot.

Conclusion from Clustering

Looking at this cluster, we can conclude that:

  • Red cluster : have high serve percentage, good at breaking opponent serve, tiebreaker winner, and good at saving breakpoint situation
  • Purple cluster: have high 1st serve winner (ace), but needs a lot of game to win the match
  • Yellow cluster: create a lot of double faults and need a lot of game to win the match
  • Blue cluster: Good at breaking opponent’s serve
 




A work by Faris Dzikrur Rahman